Frequency in Morphology
ثبت نشده
چکیده
0 Introduction The recent w ork in statistical parsing Church 1988, Schabes 1991 and statistical machine translation Brown et al 1990 calls the traditional rule-based view of grammar i n to question. These authors emphasize that grammatical rule systems aiming at syntax-directed translation, and even rule systems aimed at the description of a single language, break down when faced with the actual complexity of natural language data. In fact, under realistic testing conditions the example-based" or corpus-based" systems that employ some general-purpose optimization algorithm in order to extract statistical regularities from the data fare just as well as the rule-based systems in which the regularities are extracted beforehand by the grammarian. In the light of these facts it is natural to extend the inquiry to morphology and ask how statistical morphological systems that exploit the frequency information in the data will compare with rule-based morphological systems that exploit the expertise of the grammarian. The paper reports the results of a pilot study performed on the largest extant machine-readable corpus of a morphologically complex language, namely the SZO1TA1R corpus see Kornai 1986 based on the Debrecen Thesaurus Papp 1969 and the Frequency Dictionary of Hungarian F uredi and Kele-men 1989. Section 1 presents the necessary background information about statistical approaches to grammar in general, about the traditional position-class view of morphology and its modern generalizations Kiparsky 1982, Koskenniemi 1983, and about the Hungarian nominal and verbal paradigms Antal 1961, 1966. Section 2 presents a simple statistical method to estimate the number of innected word tokens required for saturated paradigms" i.e. stems for which all paradigmatic forms are actually attested in the corpus. Although the SZO1TA1R corpus is to small to contain fully saturated paradigms, using an empirical nding of the pilot study, namely that linguistically orthogonal mor-phosyntactic features are statistically independent, we can still estimate the required corpus size. In order to saturate the paradigms of the most frequent 1,000 verb stems we need to collect a corpus 630 to 800 times bigger than the present one which is .5m words, and in order to saturate the paradigms of the 30 most frequent nouns we w ould need over 33 billion words. Do we really need all the paradigmatic forms to be exempliied in our corpus to be able to build them into our statistical morphology? The concluding Section 3 proposes hybrid Hidden Markov Model HMM systems that bring both grammatical …
منابع مشابه
Correlation Between the Acoustic and Cell Morphology of Polyurethane/Silica Nanocomposite Foams: Effect of Various Proportions of Silica at Low Frequency Region
Introduction: Reducing noise pollution has become an essential issue due to the increase in public concern and also social demands for a better lifestyle. Using sound absorption materials is a preferred method to reduce the noise pollution. Undesirable properties of pure polyurethane such as poor absorption of mechanical energy in narrow frequency ranges can be improved by providing polymeric n...
متن کاملSpatio-Temporal, Mineralogy and Micro-Morphology of Dust Occurrences and Centers with Internal Sources in the Khouzestan Province
Extended abstract 1- Introduction Dust occurrences as natural events are common in arid, semi-arid and desert areas. Investigation of the dust with internal sources in the Khuzestan province including about 15 percent of the dust events coming to the region and the presence of the annual average of 50 times of the internal dust (with the concentration maximum of PM10 particles more than 8000p...
متن کاملEffects of combined magnetic fields on human sperm parameters
Background: In previous investigations, it has been clarified that electromagnetic fields (ELF) can cause some changes in cellular behavior. The aim of this prospective study was to investigate the effect of magnetic field (MF) on human sperm parameters of motility, morphology, and viability. Materials and Methods: Semen samples were collected from 12 fertile men, and were allowed to l...
متن کاملP-23: Is There An Association between HOST Grades and Sperm Quality
Background: Intracytoplasmic sperm injection (ICSI) can be considered the solstice in vitro insemination technique because it has powerfully allowed the treatment of male factor infertility, but in this procedure, except visual morphological selection, there is no standardization for sperm selection. Recently, hypo osmotic swelling test (HOST) has been proposed to have the potential to select i...
متن کاملP-95: Occupational Hazards and Male Infertility
Background: Infertility can be a major concern for couples trying to conceive and occupational hazards can be a main cause in infertility in men. Studies conducted throughout the world indicate that workplace physical and chemical hazards could have a negative impact on male fertility. The main object of this study was to determine the frequency of occupational categories of men who attended an...
متن کاملFrequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time
There is substantial debate over the mental representation of regular past tense forms in both first language (L1) and second language (L2) processing. Specifically, the controversy revolves around the nature of morphologically complex forms such as the past tense –ed in English and how morphological structures of such forms are represented in the mental lexicon. This study focuses on the resul...
متن کامل